Crawling Deep Web Content through Query Forms
نویسندگان
چکیده
This paper proposes the concept of Minimum Executable Pattern (MEP), and then presents a MEP generation method and a MEP-based Deep Web adaptive query method. The query method extends query interface from single textbox to MEP set, and generates local-optimal query by choosing a MEP and a keyword vector of the MEP. Our method overcomes the problem of “data islands” to a certain extent which results from deficiency of current methods. The experimental results on six real-world Deep Web sites show that our method outperforms existing methods in terms of query capability and applicability.
منابع مشابه
A Structure-Driven Yield-Aware Web Form Crawler: Building a Database of Online Databases
The Web has been rapidly “deepened” by massive databases online: Recent surveys show that while the surface Web has linked billions of static HTML pages, a far more significant amount of information is “hidden” in the deep Web, behind the query forms of searchable databases. With its myriad databases and hidden content, this deep Web is an important frontier for information search. In this pape...
متن کاملLearning to Surface Deep Web Content
We propose a novel deep web crawling framework based on reinforcement learning. The crawler is regarded as an agent and deep web database as the environment. The agent perceives its current state and submits a selected action (query) to the environment according to Q-value. Based on the framework we develop an adaptive crawling method. Experimental results show that it outperforms the state of ...
متن کاملEfficient Deep Web Crawling Using Reinforcement Learning
Deep web refers to the hidden part of the Web that remains unavailable for standard Web crawlers. To obtain content of Deep Web is challenging and has been acknowledged as a significant gap in the coverage of search engines. To this end, the paper proposes a novel deep web crawling framework based on reinforcement learning, in which the crawler is regarded as an agent and deep web database as t...
متن کاملA Novel Approach to Integrated Search Information Retrieval Technique for Hidden Web for Domain Specific Crawling
The traditional web crawlers retrieve contents from only the “Surface web” and are unable to crawl through the hidden portion of the Web containing high quality information which is dynamically generated through querying databases when the queries are submitted through a search interface. For Hidden web, most of the published research has been done to identify/detect such searchable forms and m...
متن کاملA Task-specific Approach for Crawling the Deep Web
There is a great amount of valuable information on the web that cannot be accessed by conventional crawler engines. This portion of the web is usually known as the Deep Web or the Hidden Web. Most probably, the information of highest value contained in the deep web, is that behind web forms. In this paper, we describe a prototype hidden-web crawler able to access such content. Our approach is b...
متن کامل